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The core underlying pfJ^L of a dynamically reconfigurable processo^^the optimum 
reutilization of computing resources. These computing resources can vary from a 
complex application specific'form to a relatively, simplistic version for general purpose . 
computing. A necessary forta of communicafion must exisf between the processing 
engines (modules) and the memory unit(s). If the data processing is shared between 
multiple execution modules, then a secondary form of communication amongst them is 
necessary. This gives rise to two important issues: 

1) Complexity and processing power of the execution modules 

2) The burden of coordinating these modules to work as an integral unit 

The more complex the individual modules, the lesser the communication amongst 
themselves. But this, implies decreased flexibility in mapping different algorithmic 
operations onto a module. If these modules are to be derived based on a given class of 
applications or algorithms that will be ported onto the processor, then thete is an mcre^e 
in the complexity for identifying svph modules. The "benefits though a*e " faster 'execution 
times. 

The second issue deals with the predictability of communication between certain sets of 
modules with other sets (consumers and producers). The more the predictability, easier is 
thcproblem of scheduling the tasks amongst them. 

In this report, we will present a technique of resolving both these issues with close to 
optimal solutions and compare them with e»stin§ytnethods or theories. . 

Identifying the optimal segments of an algorithm or application for clusterizing. 
In [Dasu] we had described a process of narrowing down the search region for identifying 
clusters, A further analysis showed that a basic block can consist of several zones. This is 
seen in figure 1, where a small part of the application code of MPEG-4 visual decoding is 
thoroughly parallelized. What we see is a group of DAGs. The cyclic nature of some 
graphs have been eliminated by pre-computing the iteration counts (shown in orange 
circles). It should be noted that exclusion of branch operations in the graphs might result 
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in a large number of dft The problem of trying to find clusters ha^Rn addressed in 
the field of music analysis and composition. [Emilios] formulates the problem as : 

Conceptually we take a similar approach in associating multiple dimensions to a cluster. 
In order to scientifically determine the amount of paraUelization required and the nuinbefr 
of resources necessary to perform die set of tasks, we propose that: 

1) Identify those zones with the most number of predetenninable iterations, or lying 
in the most number of loop nestings- Information obtained from running 
benchmarks can be used in this phase. 

2) Populate a table whose row and column header entries are these zones. Along the 
columns, left to right : most iterative to least iterative. Along the rows, top to 
bottom : most iterative to least iterative. There 'can be 3 .types" of entries in the*, 
table, (i) A smiling face indicates a perfect match while migrating from a source 
to a destination zone or vice versa, (ii) A sad face indicates a partial match j(iii) A 
blank indicates no match. 

The technique of matching such source -and djestination zones can be performed 
through tree matching or graph matching. Some researchers have adopted the 
simpler tree matching method. This is applicable if the source and destination 
zones are trees. Obviously this severely narrows down the types of applications. 
For a more generic application agnostic approach, this reduces to a graph 
•matching problem. There have bepn two notable approaches in this area of 
research, (i) Graph matching as a graduated assignment problem 
(ii). Graph matching by growing nodes method. 

Graduated Assignment method [Anand]: 

Ixv this approach, a match between 2 graphs is obtained by formulating the 
differences between the weighted graphs as an objective function. The authors 
then try to minimize this objective function. 
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£«b(M) = ~ Z S t i M^Mp C 

subject to Va EM d <l, VI £M df <l, VaiJ^€{0,l} 
i=l a=l 



Here the Mai and Mbj are the same matrix (called matching matrix). The C term is 
the difference in the weights of 2 edges being compared. The summation basically 
is a combination of every possible edge comparison between the 2 graphs.'So, if 
there is an exact match between 2 edges then the product of the Ms and C will be 
1„ else it will hi 0. Hence the minimum value of the summation represents the 
. maximum number of edge matches between the 2 graphs. 
Since a node in a graph can only match up with one node in the other graph, the 
match matrix should be a permutation matrix. 

In classical combinatorial problems, assignment problem corresponds to finding a 
permutation matrix for a given sample matrix such that the. summation of the 
chosen elemehts (an element in the Sample 'matrix is chosen if its corresponding 
entry in the permutation matrix is 1) is maximum. The authors try to convert the 
given minimisation problem into* a maximization problem, by expanding the 
objective function using taylor's series. They then convert the the discrete 
problem into a continuous version by using a control paranieter. This is done by 
producing an initial match matrix, obtained by exponentiating the? error Amotion. 
This- is then subject&Ho iterative row and column normalization 'which shotxld 
result in a doubly stochastic matrix (Sinkhom's rule). Sinkhorn's rule states that, 
any positive matrix when iteratively normalised along rows and then along 
columns will converge into a doubly stochastic matrix. A doubly stochastic matrix 
i&one whose summation along any row or column, is positive and less than or 
equal to one. The newly obtained matrix is again reused in the objective function 
with a newly increased parameter value. 

The complexity of this approach is 0(lm) where 1 and m are the number of edges 
in the 2 candidate graphs. 
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corresponding vertices, weighted value of the edge in terms of bit precision and 
data type. In our problem of graph qomparison, the y-axis represents the 
operations and the x-axis represents the cycle of execution. Hence it is'possible 
for a shifted version of a candidate graph to have a completely different canonical 
label. Thus completely missing the match/Therefore, flexibility for a match along 
the time/ clock cycle axis is necessary. We also compare the number of bits 
required to represent the matrix and the cELnonical label (MDL encoding) are far 
less than contemporary : xnethQ£ls. . These concepts will be ittuateated by an 
example. Consider the foll^wiAg^graph^ ifi Figure i (f-bTfCk "With i^,^2-black 
with dotted green aiid 3 -blue). * # . 




I 2 3 '4 " 5 6 7 



Figure 1 Graphs to illustrate commonality concept 
To start with, if there exist multiple nodes at the same (x,y) location, then they are 
merged into a single node and this information is noted as a part of the weight of a 
node. If two edges are merged, then that is noted down as a part of weight of the 
edge, so that the original graph can be reconstructed. A graph is represented in a 
adjacency matrix whose rows are numbered as (1-1, 1-2, i-3, 1-4, 2-1, 2-2...:..) 
and' so are the columns. The first number (before the hyphen) denotes Jhe cycle 
and the second number (after the hyphen) denotes the operation. The order in 
which the non zero elements of the matrix are read out does not affect the 
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matching process, because the nodes are encoded with respect to fixed x and.y 
axis, and we use .string comparison techniques. If we use suffix automaton, its 
possible to detect the largest common sub-graph between graphs 1 arid 2. But to 
obtain the LCS between 1 or 2 and 3, a 'Shift table* is built as follows: 
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The entries on the columns are the edges of "graph 3 and the entries along rows are 
edges 'of graph 1. The intersections are marked by 2 digits indicate the difference 
in the start cycles of the edges and the duration of the edge. Only^edges with the 
same, source and destination operations and same duration can be compared. The • 
x entries imply that the source and destination operations are the same but the 
durations are different. Now the entries with same operation combo and duration 
are counted. This results in Count of +1,1 = 6; Count of +2,1 =2 and 
Count of -3,1 = 1. Therefore if graph 1 is shifted by +1 units, there is a guaranteed 
6 edge match. This only gives' a match' number, but the'.matching itseif is 
performed through the super alphabet string matching technique/ 
The MDL encoding of the graph can be performed with less than $0% of the 
number of bits required to encode the same using the method by [UTA]. The 
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Attorney DockelNp. 9IS8-0IU(>HR\r/ 

Node growing method: 

[3] proposed this .method for graph sarches 'in databases. TJiis involves the 
formation of a flbol of 2 node graphs. The candidates for the nodes in this pool are 
all possible nodes from the database of graphs to be searched. Each of these small 
graphs are compared with every candidate graph in the database. The bottom x % 
of die matches are then pruned. The repaaining 2 node graphs are now gown into 3- 
node graphs. Every possibility is grown. Then the process <if comparison is 
repeated and pruning carried out. The match between 2 graphs is done through ' 
comparison of the canonical libels of their adjacency , matrices. Various 
cahomfication functions can be used to derive the- labels.- {George] uses the 
longest possible label. But the drawback of this approach is the lack of weights to 
the edges. Even without weights complex partitioning schemes are applied to the 
adjacency matrices to obtain labels for comparison. 

Several other universities [Delft], [UCLA] have also adopted this approach 
without the use of graph comparison through canonical labels. 

Both of these approaches suffer from several disadvantages. The graduated 
assignment method being iterative, gives no indication as to how fast the 
objective function converges. It is also an approximation approach since it 
involves maximisation of an^error term. The growing nodes method (which was 
also propose^ in the qualifying exam) is very slow a£d duinbersomeV 
Therefore the following riiethod is proposed, which has the least complexity and 
hence very fast. 

Graph through suffix automaton: 

IUs^well known that suffix automaton is the fastest way 0(n) to compare two 
strings. Prior to approaching the .graph comparison problem with suffix 
automaton, we had approached the comparison through the Largest Common 
Sub-String algorithm: Although slower, it shares the benefit of finding matches 
between non-contiguous edges in the graphs. The edges need to be interpreted as 
alphabets in the string. The edges are represented as a tuple of elements such as 
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process involve^feoding of the edges in the matrix through System of slope 
calculations. We start by examining the adjacency matrix. In our example, for 
every source node we need only 5 bits (3 to indicate the cycle and 2 to identify the 
position in the cycle). Therefore the numbef of bits requited to represent all 
source nodes = 7 * 5 = 35. For every source node, there follows pairs of numbers. 
These are the sloge and duration, i.e. Source Node -> Slope 1, Durationl , Slope 2, 

Duration2, The number of such pairs for every source node is an indication 

of the fen out So a list of fan out numbers is generated fpr a graph, hi our . 
example that Ust would be [1,2,3,1,1,1,1]. Therefore the number.of bits required. ■ 
to encode this, sequence is 1 bits / symbol * 7 symbols = 14 bit The node 
sequence would be [11,0,1; 12,-1,1,1,1;21,0,1,1/2A2,1; 23,0,1; 31,1,1; 33,-1,1; 
42;o,i]. 

Now the number of bits required to represent the slopes is log2 (number of unique 
slopes), and in our example that is = 3. Therefore the number' of bits to represent 
allsiopes = 3* 10 = 30. 

The number of bits needed to represent the lengths is log2 (number of unique 
lengths), and in our example that is = 1. Therefore the number of bits tp represent 
all slopes = 1 * 10 = 10. " 
■ Hence the total number of bits required to represent the entire graph = 35 + 14 + 
30 + 10 = 89 bits. 

In [UTAJimethod we would ^iegd about 180 bits for the same encoding. Hence 
our me&pd adheres to the MDLrequiranent . 

Our method of finding the MDL seems quite close to simply encoding the edges 
as a pair of vertexes (direct encoding). So we explored the option of encoding an 
edge in terms of its Source Vertex, Slope and Clock Duration. We will explain the 
process and the overheads involved and then provide an equation, which will 
indicate any savings in bits compared to direct encoding. 

For edges sharing the same Source Vertex, we will represent that group of edges 
with the Source Vertex followed by a Slope and CD for every fen out Destination 
Vertex. The corresponding fan out number is also stored. The format therefore is: 
FO: SV,Slopel,CDl,Slope2,CD2 . 
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And the savings inbits = 
(# SVertexes)* {logzOVlax FO) + log2(# Ops + # Clocks)} 
+ (# of Edges)* {log2(Max Slope) + log2(Max CD)} 
- 2*(# of Edges)*log2(# Ops + # Clocks) 

This also needs to be compared to Run Length Coding, which is quite useful in 
encoding long sequences of numbers with a large number of zeros* 
However, clearly the most optimal ■ method depends on the adjacency matrix, 
which cannot be predicted for all classes of applications. 

Although die process of finding the Largest common sub-graph through Suffix 
Automaton is of low complexity, . yet the preparation of such* .a 'Shift Table*, 
involves all element comparison whicE if both matrices have equal number of 
edges (n) then, will be of complexity 0(n 2 ). Moreover, SA can only be used for 
exact (complete) sub-graph matching. Accommodation of partial matches can 
only be performed using the Largest Common Subsequence algorithm, which is 
of complexity order 0(n 2 ). Therefore to redube complexity of the approach, we 
modified the populatipn technique of the adjacency matrix! When an element of 
the -matrix is populated, the tag count of row (source) index is incremented. 
Initially all row index tags are assumed to be 0. The stored column index fe 
subtracted from the current column index and the difference (jump) is stored. The 
stored column index is now replaced by the current column index. Therefore for 
every tagged row index, there are 3 pieces of information. <i) tag count array 
of jumps (iii) .stored column ipdex. This process helps in quickly scanning only 
the necessary populated (non-zero) elements from the adjacency matrix; We do 
that by checking for the first row that is tagged. Then using the jump array, we 
pick up the next element, decrement the tag count and proceed till the tag count is 
zero. Then check for the next tagged row. 

The string of edges (elements) obtained this way will now be sorted using an 
efficient algorithm such as Merge Sort Whose complexity is much lesser than 
0(n 2 }. An edge consists of 4 basic elements (Source Clock, Source Operation, 
Destination Clock, Destination Operation). The sorting criteria witft>e: same SO- 
DO combination starting fiom lowest SO; if SOs are the same' and DOs are 



116955 



/iff ornery uuciwi t*u. *t -nj-v# wj r * 



different, then Io^TdO takes priority. If SOs arg different, theftwer SO takes 
priority. Then once a sorted list is obtained, a set of m 2 bins are created, where m- 
is the number of operations. The bins are arranged in a queue with the same 
priority criteria as used for the sorting algorithm. For example, if we had 4 
operations CU.3 &A\ then the queue of bins would be 

(11,12,13,14,21,22,23,24 ) where 12 will mean! SO =1 and DO =2. Now we 

need to place the edges into appropriate bins. This is done by comparing an 
element's (edges) SODO pair with the head of the queue's SODO. If a match 
occurs, then place the element into feat bin. Else if a mismatch occu*s, men the 
head of the queue is shifted to the right by one position and the SODO 
comparison is again made, and queue head shifted till a match is met. Once all 
elements are placed in the bins, then a similar process is followed, for the -other 
graph to be compared, and a second bin queue is obtained. 
To find the largest Common Sub-graph, we only need to operate on 
corresponding pairs of bins (bins with same SODO) from both queues, where 
both bins have elements in them. For example a sample bin pair queue is shown in 
Figure2. 





Figure 2 Bin pair queue 
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We perform a ^^algorithm on the chosen bin pairs (for exan^^>n bin pairs 13 
and 13). Although LCS has a 0(n 2 ) complexity, yet the number of elements being 
compared are much lower than the whole set of edges. But care must be taken to 
arrange the edges in each bin in the order of increasing Source Clock cycles. This 
will simulate an alphabet sequence such as 'abcaaba...' where the repeated 
alphabets are shifted versions of the first occurrence. 

He resulting set of Largest Common Subsequences from the chosen bilfli plaits are 
collectively called the Largest Common Sub-graph. 

We confidently believe that for my application, this method is the. fastest and 

least complex approach to finding the largest common sub-graph. 

Constraints of reconfiguration distance can be applilti at the bin choosing step. . 

3) A probability tabl€ of source to destination jump is built This-helps in pruning 
out zones which are less likely to occur and have low matching with other tones. 

4) The zones are now designed in hardware such that the edit distance between 
zones portable on a module involves a minimum amount of logic'switching. 

Task Scheduling . 

We-now address the second issue of task scheduling. In the graph matching problem, 
we can include branch operations to reduce the number of graphs. This can be done, 
if one of the paths of a branch operation leads to a very large graph compared to the 
other path, -or is' a subset of the other path. This Still leaves with the probtem of' 
conditional ta^k scheduling with loops involved. Several researchers have addressed 
task scheduling and one group -has. also addressed loop scheduling with conditional 
tasks. In this report we will not discuss all the relevant work done by other research 
groups. Instead we focus on some important ones and propose a method most suitable 
for the purpose of reconfiguration and compare it with the contemporary methods, 
[©hekuri's] paper discusses the earliest branch node retirement scheme. This is 
applicable for trees and s-graphs. An s-graph is a graph where onty one fcath has 
• weighted nodes. In this case, it, is a collection of DA<3s representing basic Blocks 
which all end in branch nodes* and the options at the branch nodes are: exit form the 
whole graph or exit to another branch node as shown in Figure 3. The two DAGs are 
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in blue and violet and the branch nodes are in red with probabiiities'bf exiting the 
system being p and q. Such a graph is scheduled on a generic set of processing 
elements such that the branch node with "the least schedule length to sum of exit 
probabilities from the system (upto that branch node) is* scheduled" as "early as 
possible. The denominator implies how many sub-graphs 'can be cOmpleted r . The 
numerator implies 'how fast this can be done' or 'w,hat is the least amount of time ir 
takes to do it'. Therefore!, a low rank indicates that a large nim*er of sub-graphs can 
be completed in a small amount of time. This is a form of list based scheduling where* 
they try to minimize the expected time <*f completion. * 




Figure 3 Early retirement schedule 



The problem with this approach is that its applicable only to 'small, as defined by the 
authors' graphs and also restricted to 3-graphs aiid trees. It also does not.oonsider nodes 
mapped.to specific processing elements. ► 

tJha's] paper addresses scheduling of loops with conditional paths inside them. This is a 
good approach as it .exploits parallelism to a large extent and uses loop unrolling. But the 
drawback is that the control mechanism for having a knowledge of 'which iteration's data 
is being processed by which resource' is very complicated. This is useful for one or two 
levels of loop unrolling. If is quite useful where the processing units pan afford to 
communicate quite often with each other and the scheduler. But in our case, the network 
occupies about 70% of the chip area [Andre] and hence cannot afford to communicate 
with each other too often. Moreover the granularity level of operation between processing 
elements is beyond a basic block level 2nd hence fhis method is 'riot practical. Aiid within 
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a processing element, smde the reconfiguration distance (edit distance) is ittbre important, 
fine scale scheduling is compromised because the benefits, with the use of very fine grain, 
processing units is lost due to high corifigurationload time. * 

[Mooney's] paper discusses a 'path based edge activation' scheme. This basically me&ns, 
if for a group of nodes (which must be scheduled onto the same processing unit and 
whose schedules are affected by branch paths occurring at a later stage) we<know ahead 
of time the branch controlling values, then we can at run time prepare all possible 
optimized list schedules, for every possible set of branch controller values. In the 
following Simple example shown in Figure 4, tfte* nodes in gray*nefcd to be scheduled on 
the same processing unit. The branch controlling variable is'b which can take values of 0 
or L In case it takes a 0, the branch path in red is taken, else the path in green is taken. In 
the case where we can know at run time, yet ahead of time of occurrence <Jf4he1>taftch- 
paths, the value of *b\ we can prepare schedules for the 3 grey nodes and launch either 
one, the moment b's value is known. 




Figure 4 Path based edge activation ' 

This method is very similar to the partial critical path based method proposed by [Pop]. It 
involves the use of a hardware scheduler and is quite well suited for our application.' But 
we need to a<3d another constraint to the scheduling: .the amount of reconfiguration or the 
edit distance. 

. [Pop's] paper tackles control task scheduling in 2 ways. The first is partial critical .path 
based scheduling, which is jdiscussed above. Aithou^b they do not assume that the V4&e 
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of the conmtiohal controller is known prior to the evaluation of the branch operation. 
They also propose, the use of a branch and bound technique for finding a schedule for. 
every possible branch outcome. This js quite exhaustive, but it provides an optimal 
schedule. Once all possible schedules have been obtained, the schedules are merged. 
The advantages are that it is optimal, but its has the drawback of being quite complex. It 
also does not consider loop structures. ' 

We propose to utilize a combination of these methods with modifications to suit our 
needs of reconfiguration. At the coarse level of scheduling, we will exploit all foose 
clusters, whose iteration counts are known at compile time. Loop unrolling is performed 
to the extent that the edit distance to the next possible configuration on the same module 
is not disturbed to a large extent. For scheduling, clusters on the same module, if the 
branch controlling conditions are known at run time, but prior to the branch execution, 
we will employ [Moone/s] method. For all other possibilities we will use thebranch and 
bound technique, but with 2 additional constraints. One is that of reconfiguration distance 
which will affect the bounding, and the other is for loops whose life times are not known 
at compile time. All dependent nodes (clusters) are in a wait state till the loop is 
completed. This indicates a break point in the schedule chart. 
Therefore in our approach the factors being considered are: 
data dependency 

' • execution time (cam minimize^by parallelizing and faster clock) 

• . reconfiguration time (must be minimized as it is a significant fector) 

• communication time (in more likeiy fcases of branch prediction, it should hot be a 
problem, but in the unlikely cases it might be significant) 

Task scheduling for control data flow graphs 

Problem statement: 

Given a control-data flow graph, we need to arrive at an optimal schedule. Sections 1 and 
2 introduce the CDFG and the resources. Section 3 discusses the methodology to arrive at 
the optimal schedule; This includes PCP scheduling and improvisations for specific 
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situations. It also talks about the technique used for . merging of various schedules. 
Section 4 discusses reconfiguration. Section 5 talks about issues involved with loops. 

1. Control-Data Flow Graph: 

A directed cyclic graph has been used to model the entire application. It is a polar 
graph with bolh source and sink nodes. The graph can be denoted by G (V, E). V is 
the list of all processes that need to be scheduled E is the list of all possible 
interactions between the processes. The processes can be of three types: Data, 
. communication and reconfiguration The^edges can be of three types: unconditional, 
conditional and reconfiguration. Here a simple example with no reconfiguration and 
no loops has been sbfown in Figure 5. Reconfiguration tfiii be dealt wi4h in section 1: 
Loops will be handled in sectioa 8. 
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Figure 5 An Example of a Control Bata FJow Graph. . 

In the^above graph, each of (he circles represents a process. Sufficient resources are 
assumed for communication purposes. AH"the processes have an execution time 
associated with them, which has been shown alongside each circle. If any process is a 
control-based process, then the various values to which the condition evaluates to are 
shown on the edges emanating from that process circle. 

2f Resources: 

Let the number of resources allocated for Process, of type PEi beNi. In Figure 1* me 
following configuration has been assumed. There are three processors PEI, PE2 and 
PE3. Nl = N2 = N3 = 1, one for each type : of process. 
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Determination of number of resdurces for each type of process: 
Our approach: 

Obtain the sub-graphs for every possible path. For example, the graph in Figure .1 is 
< used to obtain the path for DCK (sub-graph shown in Figure 5-a). " 

(3 Source and Sink nodes 
O Process type 1 
O Process type 2 
Process type 3 

6 




Figure 5-a Sub-graph for condition DCK 

For this particular example graph, 6 such.sub-graphs are obtained, one for each of <he 
conditions: DCK , DCK ,DCK ,DCK ,DC ,DC . 
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For ariy sub-graph, we can determine the number of units of each type of processor. 
This is done by isolating the nodes corresponding to a processor type. For example, in 
the DCK example taken* above, 3 more graphs can be obtained as shown in Figure 6 
and 7. It might not make sense to apply this policy to all the 6 sub-graphs. Therefore 
only those deemed as most likely to "be taken should be considered. In case an 
unlikely path does end up being taken, the clock speed for the general purpose 
^computing resources (the programmable LUTs) must be. designed suitably if.real : time 

requirements exist. „ 

„ . H Process type 2 

Process type 1 



o 



Process NOT type 1 



• o 



P4«q8ss NOT. type 2 





Figure 6 Graphs obtained for individual process tyfles 1 and 2 
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f Process type 3 

* 

. Process NOT type 3* 



Figure 7 Graphs obtained for individual process type 3 

For a graph with only a specific process type highlighted, its possible to obtain the 
number of processor units, by identifying the critical path and separating the graph . 
into processes in the critical palh-and those dutside -it. For example if we had a graph 
as shown in Figure 8, with the critical path marked in dotted line arrows, we would 
group .processes PI, P5, P8/P9, P10 ? P12 'and P13 into the primary group. And we 
would place all the other processes into another group called the secondary group. If 
the combined execution time in the primary group is say Tp and the combined 
execution time in the secondary group is Ts, then we check for the ratio of Tp : Ts. If 
the ratios are close to 1:1," then it means that most likely, maximum benefit can be . 
obtained by scheduling each of the groups onto 2 parallel processors. If the ratio is 
• l:x where x2!,. then in the secondary group, a critical path is identified. Thus the 
secondary group is similarly split into 2 groups. We proceed in this divide and 
conquer method till a l:l:l...or.a close ratio is obtained. But if*Tp : Ts is x : 1,-th'en 



I I U>«>55 



20 



there would be an underutilization of resources if additional processing units .are 
allocated. In this case we might be better off using a single resource allocation. 




Figure 8 Divide and Conquer method of determining number of parallel units 

i 

3. Methodology; .. 

i. Use PCP scheduling to determine the delays for each p.ossible path of the" 
CDFG and arrange the list of paths in descending order of the delays. 
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ii. Perform branch and bound based scheduling (which need not be done for 

every path to reduce the complexity), 
iih Once the final list of all schedules is ready, merge all the schedules by 

respecting data and resource dependencies. . 

. PCP scheduling : 

PCP is a modified list-based scheduling algorithm. The basic concept in a .Partial 
critical path based scheduling algorithm is that if we have a situation as shown in 
Figure 9 below, where Processes' P A , P B , Px, Py *e all to be mapped onto the' 
same resource say Processor Type 1. P A and P* are in the ready list and a decision 
needs to be taken as to which will be scheduled first. X* and Xb are times of 
execution for processes in the paths of P A and P B respectively, but which are not 
allocated on the Processors of type 1 and also do not share the same type of 
resource. 




Figure 9 V CP based scheduling 
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If P A is assigned first, then the longest time of execution is decided by the 
MaxfT A + >WYTA + T B +XB) 

If P B is assigned 'first, then the longest time of execution is decided by the 
Max (T B + Xa T B + T A + Xa) • 

Thcbest schedule is the minimum of the-two quantities. This is called* the partial 
critical path method because it focuses on the path time of the processes' beyond 
those in the ready list Therefore, if Xa is larger than Xa, a better schedule isf 
obtained if Process A is scheduled first. But this does not consider the resource- 
sharing possibility between the processes in the path beyond thbse in the ready" 
' list. A simple example (Figure 10) shows that if T A = 3, T B = 2, Xa = 7, Xb = 5, 
where in processes in the Xa and Xb sections share the same resource, say 
Processor type 2, then scheduling Process A first gives a time of 15 and 
scheduling B first gives a time of 14. But both the critical path and PCP as 
proposed by Pop suggest scheduling B first". 
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Figure 10 PCP scheduling with Resource dependencies in the Partial path region 

The difference is because, if the resource constraint of the post ready list 
processes is considered, the best schedule is a min of 2 max quantities: 
Max (T B , \0 & Max (Ta, >b). 

Pop [Pop] uses the heuristic obtained" from PCP s6heduhng to bound the 
schedules in a typical branch and bound algorithm to get to the Optimal schedule. 
But branch and bound algorithm is an exponentially complex algorithm in the 
* worst-case. So there is a need for a. tesseT complex algorithm that can produce 
near-optimal schedules. From a higher, view point of scheduling we. need to limit 
the need'for BB scheduling as much as possible. 

Initially, the control variables in the CDFG aire extracted. Let cl, c2 ,cn be 

the control variables. Then there will be at most 2*n possible data-flow paths of 
execution for each combination of these control variables from the given CDFG. 
Air ideal aim is to* get the optimal schedule at compile time- for each of these 
paths. Since the control information is not available at compile time, we need to 
arrive at an optimal solution for each path with every other path in mind. This 
optimal schedule is arrived at in two stages. First the optimal individual schedule 
for each path is determined. Then each of these optimal schedules is modified 
with the help, of other schedules. 

Stage h There axe ^=2^ possible Data Flow Graphs (DFG's). For each DFG, 
the PCP scheduling is done, Then, the DFG's are ordered in the decreasing order 
of their total delays. 

An optimal solution can be obtained by doing branch and bound scheduling for 
each of these PCP scheduled DFG's. But branch and bound is a highly complex 
algorithm with exponential complexity. In this case, this complex operation needs 
to-be done 2*n times, where n is the number of control variables, which increases 
foe complexity way beyond control'. Hence Branch and bound is done only when 
it is essential to do so. Then BB scheduling is done for DFG1, which has foe 
largest delay. For DFG2, the PCP-delay is compared with fee BB delay of DFG1. 
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U foe PCP delay is smaller, then title PC3> scheduling is taken as the optimal 
schedule for that patn. If not, then the BB scheduling is done to get the optimal 
schedule. It makes sense to do this, as the final delay of each DFG after 
modification is going to be close to the delay of the worst delay path. In the same 
way, the optimal schedule is arrived at for each of the DFG. 
Stage 2: Once the optimal schedule is arrived at, a schedule table is initialized 
with the processes on the rows and.foe various combinations of control variables 
on the column. A branching tree is also generated, which ^hows.foe vaiffous 
control paths. This Contains only the eontrol information of foe CDFG. There 
exists a column in the schedule table corresponding to. each path in foisbrpncnihg 
tree. The branching tree is shown in Figure 41. The path corresponding to the 
maximum delay is taken and the schedule for that corresponding path is taken as 
the template (DCK'). Now the DCK path is taken and foe schedule is modified 
according to that of DCK'. This is done for all foe paths. The final schedule table 
obtained will be the table that resides on the processor. 

) j^gth of the o ptimal schedule for the 
^r pative oaths thro ugh the graph in Fie. 1 
DaCaK 39 
* D"aC 39 
DaCaK 38 
DaCaK 32 
DaCaK 31 
D*aU 31 

Figure 11 Branching tree 



true 




The pseudo code of this process is summarized in Figure 12. 
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rJkschMe(G(^A CTRL_VARS[N], PE = {PEI,PE2...:.PEM» I 
For each combination of CTRLJVARS do 

{ Generate a DFG Gsub(V,E,CTRL^VARS[i]) which is a sub-graph of ^^*| 
Qrdy the nodes and edges in the control flow corresponding to the current 
combination of CTRL VARS are included in this sub-graph^ I- 
GeZrate the PCP schedule of GL Let the schedule be PCP_sched(J] and thd 
delay be PCPjtelayffl. 

Sort PCP_sched and PCP_delay and Gsub in decreasing order ofPCPjelayffl. 

Generate the Branch and bound schedule for Gsub[dj, the sub-graph wUh 
M%oy. Let the schedule be BB_sched[I=0] and the delay be BBjelay[I-OJ. 
Initialize ^orsl'_bb_delay'= BB_detay[0] 

For all the other sub-graphs do 

if(PCPjlelay[l] < worst_bb_delay) then 
BB_schedpj =PCP_sched[H; 
BB_delay[I] = PCPJelayfTj; 

else I 
Generate BBjschedp] and BBjlelayPJ; . 

If(BB_delay[I] > worstjbbjdelayffl} then 
Worst_bb_delay=BB_delay[IJ; 

} 

Generate the branching tree with the help oftheG(V,E). In £*- H 

edge represents the choices (Kand K') and the node 'W™™**'Z^ZJcha vJ 
Initiate the current path to the one leading from th*top to the leaf in ^faw^ 
Zi the DFG» corresponding to this path gives the worst M delay, The path i* 
nothing but q list of edges tracing from the top node till the leaf . 1 

Figure 12 Selective use of PCP and BB algorithms 

We also observe that processes with large execution times have a greater impact 
on the schedule than the shorter processes. Hence, we decided to schedule large 
processes in a special way. The shorter processes can be scheduled using the PCP 
scheduling algorithm. Since PCP schedulingjs done for most of the processes, the 
complexity stays closer to 0(N), where N is the number of processes to be 
scheduled. 
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a) Identify the ff^ set of processes, that need to be schedule^lnto the same 
* processor which -are computationally convex. Let's call them MP1, 

MP2.... (Macro process 1 etc.) 

b) Schedule all the processes till these macro processes in the data flow graph 
using PCP scheduling. 

c) Calculate the estimated execution time of the smaller processes to find the 
start time of each of the macro process. 

d) Determine the next set of such macro processes in the DFG. Let's call them 
MP_sub 1 , MP_sub2 . . . 

e) For processes amidst these two sets of macro processes, PCP scheduling is 

used. 

f) For processes occurring after the second set of macro processes, the execution 
times are added up to get the total execution time. 

g) Now, determine the order of execution of these processes by estimating the 
worst-case execution time in each case and selecting the best amongst them. 

h) After this scheduling, the block after the second set of macro processes is 
taken as the current DFG and steps a-g re implemented. 

i) Step h is repeated till the end of DFGis reached. 

Schedule merging algorithm: 

In the schedule table there are some columns representing paths that are complete 
and some that are not The incomplete paths can be fcow referred to as parent 
paths of possible complete paths. 

In the example shown in Figure 1, we see that for earliest evaluation of all 
conditional variables (viz. D, C, K) it is necessary to evaluate D first, then C and 
then K. Therefore the tree of possible paths is as shown in Figure 11. Now, while 
creating the schedule table, initially only consider the full possible paths i.e. , the. 
6 paths listed in Figure 11. Perform scheduling by the suggested algorithm. This 
will fill these columns. Then create the remaining column of partial paths (i.e., D, 
D a C> mt <etc ). These are now Just empty columns. Now if a process has the same 
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start times in multiple columns, then push it into the parent empty column. This 
approach tries to obtain the worst case delay and merge all paths to that timeline. 
Since the D A C A K(bar) path had the worst case optimal delay, all other full paths 
were adjusted to match 'this path. But it is also necessary to consider the 
probability of the occurrence of all the full paths (6 of them). Then prune out the 
bottom 10% of the paths, that is, disregard those full paths whose probability of 
occurrence is less than a threshold value when compared to the pa* with most 
probable occurrence. 

Then a path is selected from the remaining ones, whose probability of occurrence 
is the highest. This will be the new reference to whom all the rernaining paths will 
adjust to. Now it is likely that these chosen full paths and the disregarded full 
paths, share certain partial paths (parent paths). Thereforf, while allocating the 
start times for the processes that fall under these shared partial paths, yre must 
allocate them based on'the worst (most delay consuming) disregarded path which 
needs (shares) these processes. While performing schedule merging, all data 
dependencies must be respected. 



4. Reconfiguration: 

Incorporating Reconfiguration time into Control fldW graphs involves the 
. following steps > • 

i. Special edges are added onto the control flow graphs.. These graphs exist 
■ between a similar set of processes, which will be executed on the same 

processor with or without reconfiguration. 

ii. Reconfiguration times affect the worst-case execution time of loopy codes. So 
this has to be taken care of, when loopy codes are being scheduled. 

in. Care needs to be taken to schedule die transfer of reconfiguration bit-stream 
from the main memory to the processor memory. 

Before the concepts involved in Loop based or influenced scheduling are explained, a 
brief overview of the' architecture of the reconfigurable unit must be presented. In 
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LSM = Logic Schedule Manager, NSM = Network Schedule 



Figure 13 Overview of the system architecture 




GkU<= Configurable Cogic Unit; LU = Logic Units; SN « Switching Network 
CM = Configuration Memory: LSM = Loeic Schedule Manaeer 



29 



1 14 The internals of the Reconfigurable Unit 
The Network Schedule Manager figure 14) hasSecess to a set 6f tables, one for each 
processor. A table consists of possible tentative schedules for processes or tasks that 
must be mapped onto me "corresponding processor subject to evaluation of certain 
conditional control variables. 

The Logic Schedule manager schedules and loads the configurations for the processes 
-mat need to be scheduled on the corresponding Processor ie. all processes that come 
in the same column (a particular condition) in the schedule table. 
In POP scheduling,, since the scheduling of the processes in the ready list depCffds 
only on the part of the paths following those processes, the execution time of the 
processes shall initially conveniently include the configuration time'. 
Once a particular process is scheduled and hence removed from the ready list, another 
process is chosen to be scheduled based on the pep criteria again. But this time the 
execution time of that process is changed or rather, reduced by using the 
1 reconfiguration time, instead of the configuration time. Essentially, for the first 
process that is scheduled in a column, 
the completion time = execution time + configuration time .. 
For the next or successive processes, 

completion time = predecessor's completion time + execution time + reconfiguration 

time J 

Assurning that onoe 4 configuration hgp been loaded into the CM, the process of. 

putting in place the configuration is instantaneous, it is always advantageous to load 

successive configurations into the CM ahead of time. This will mean a useful latency 

hiding for loading a successive configuration. 

The reconfiguration time is dependent on two factors: 

1) How much configuration data needs to be loaded into the CM (Application 
dependent) 

2) How many wires are there to carry this info from the LSM to the CM 
(Architecture dependent) 

The Network Schedule Manager should accept control parameters from all LSMs. It 
should have a set of address decoders because to send the configuration bits«te the 
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Network fabric consisting of a variety of switch boxes/it needs to- identify their 
location. Therefore for every column in the table, the NSM needs to know the foute 
apriori. We must NOT try to.find a shortest path af run ifre. For a given set of 
processors communicating, there should'be a fixed route. If this is not done then, the - 
communication time of the edges n the CDFG cannot be used as constants while 
scheduling the graph. 

For any edge the^ * * 

communication time = a constant and Uniform^co^ 

data transaction time. 

5. Loop-based scheduling: 

Case 1: Solitary loops with unknown execution time. Here, the problem is the 
execution time of the process is known only after it has finished. executing in the 
processor. So static scheduling is not possible. 
Solution: ■ 

(Assumption) Once a unit generates an output, this data is stored at the consuming 
/ target unit's input buffer. 

Consider the following schedule chart (Figure 15). Each row represents 
processes scheduled on a unique type of unit. (Processor). Let PI be the loopy 
process. 
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Figure IS Scheduled process charts with resource and data dependency 
From the above figure we see that 
P3 depends on PI and P4; 
P2 depends on PI, 
P6 depends on P2 and PS. 

If Pi's lifetime exceeds the assumed lifetime (most probable lifetime), Aen aH 
• dependents of PI and their dependents (both resource and data) should be notified 
and the respective" NSM and LSM entries delayed. Of course, this inmhes.that 
while preparing the schedule tables, 2 assumptions are made. 
1«) The lifetimes of solitary loops with unknown execution times ate taken as per 

the most probable'case obtained from prior traoe-file statistics; - 
2)^ All processes that are dependent on such solitary loop processes are scheduled 

with a smallbuffer at their start times. This is to provide time for notification 

through communication channels about any deviation from assumption 1 @ 

rum time. 

If assumption 1 goes wrong, the penalty paid is: 

Consider the example in Figure 9 where 2 processes in the ready list are being 
scheduled based on PCP. Now by PCP method if V > Xb and PI & P2 do not 
share the same resource, then PA is scheduled earlier than PB. WE have assumed 
that %a is due t«? most probable execution time^of lioop PI. But at runtime if Loop 
PI. executes lesser # of times than predicted and therefore resulting ift V being < 
Xb, then the schedule of PA earlier than PB results in being a mistake. 
We- should be able to calculate the time difference between both possible 
schedules. 

We do not at this point propose to repair the schedule because all processes before 
PL have already been executed. And trying to fit another schedule at run time, 
requires intelligence on the communication network which is a burden. But on the 
brighter side, if® run time Loop PI executes more # of times than predicted, then 
wiirstill be > Xb- Thereforethe assumed schedule holds true. 
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Case 2: A combiSn of two loops with one bop feeding data ft other in an 
iterative manner. . 
Solution: Consider PA feeding data to PB in such a manner. For doing static 
scheduling, if we loop unroll them and treat it in a manner of smaller individual 
processes, then it is not possible to assume unpredictable number of iterations. 
Therefore if unpredictable, number of iterations is assumed in both loops, then 
., memory foot-print could become a sSft&as&sue. 
But an exception can be made. If both loops at all times run for me same number 
operations, then the schedule table must initially assume the niost probable 
number of iterations and schedule PAPB^AJPB arid so on in a particular column. 
In case the prediction is exceeded or fallen short off, then the NSM and LSMs 
must do 2 tasks: 

1) „ If the iterations exceed expectations, then all further dependent processes. 

(data and resource) must be notified , for postponement and notified for 
scheduling upon the iterations completion with an appropriate difference in 
expected and obtained @ run time, schedule times. If the iterations fall short 
of expectations, then all further schedules must only be preponed. 

2) Since the processes PA and PB should denote single iteration in the table, 
their entries should be continuously incremented @ run time by the NSM and 
the LSMs. The increment for one process of course happens for a 
predeterntined^umber of times, triggered off by the schedule or execution of 
the other process. For example m.Figure 16, We lee that PA' - 10 cycles, PB = 
2<rcycles and hence if both loops run for 5 times, then the entrylh the column 
increments as shown. 
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and so on. 

Figrre 16 Dynamic entry updates in the NSM and LSMs 

Only in such a 'situation can fhere.be preparedness for unpredictable loop iteration 
counts. ^ 

Case 3': A loop in the macro level i.e. containing more than a single process. 
Solution: In this case, there are some control nodes inside a loop. Hence the 
' execution time of the loop changes with every iteration. 

This is a much more complicated case than the previous options. Here lets 
consider a situation where there is a loop covering 2 mutually exclusive paths, 
each path consisting of 2 processes (A3 & C£>) with (3,7 & 15,5) cycle times. 
Bttfie schedule table there will be a column to indicate an entry into the loop and 
r columns to indicate the paths inside the loop. Optimality in scheduling inside 
the loop can be achieved, but in the global scheme of scheduling, the solution is 
non-optimal. But this cannot be helped because to obtain a globally optimal 
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solution, all poss^paths have to be unrolled and statically ftduled. This 
results in atable explosion and is not feasible in situations wberelnfinite number 
of entries in table are not possible. Hence, from a global viewpoint the loop and 
all its entries are considered as one entity with the most probable number of 
iterations considered and the most expensive path in each iteration is assumed to 
be taken. For example in the above case, path C ,D is assumed to be taken all me 
1 .time. 

Now, a schedule is prepared for each path and hence entered into the table under! 2 
columns. When one schedule rs being implemented, the enfries for both columns 
in the next loop iteration is predicted by adding the completion time of the current 
path to both column entries (of course while doing this care should be taken not to 
overwrite the entries of the current path while they are still 'being used). Then 
when the current iteration is completed and a fresh one is started, the path is 
realized and the appropriate (updated / predicted) table column is chosen to be 
' loaded form the NSM to the LSMs. 
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Docket No. 913&-0W6PRV2 

ASSIGNMENT 



We, Aravind R. Dasu, acitizen of India, residing in Tempe, County of Maricopa, State of 

Arizona, United States of America, ALI AKOGLU, a citizen of Turkey, residing, in Tempe, 

County of Maricopa, State of Arizona, and SETHURAMAN PANCHAMATHAN, a ckfeen of 

Canada, residing in Gilbert, County of Maricopa, State of Arizona, United States of America, 

have made an invention^) as disclosed in fee application for United States patented in fee 

Patent and Trademark Office entitled: 
* 

* ALGORITHM DESIGN FOR ZONE PATTERN MATCHING TO GENERATE 
CLUSTER MODULES AND CONTROL DATA FLOW BASED TASK SCHEDULING 

. <M? THE MODULES 

Acting on behalfof Arizona State University, located in f empe, Arizona, with an office 
for conducting business at the Office of Technology Collaborations and Licensing, Arizona State 
University, P.O. Box 873511, Tempe, Arizona 85287-3511, the Arizona Board of Regents, a 
- corporate body organized under Arizona law, wishes to acquire the entire right and fitle to and 
interest in the invention and any improvements on the invention and the above-identified patent 
application and any patent that may be obtained for the invention. 

We, ARAVIND R. DASU, ALI AKOGLU and SETHURAMAN PANCHANATHAN, 
. in accordance with the Arizona State University patent policy, for and in consideration of the 
sum of $1.00 paid to each of us by the Arizona Board of Regents, and for -other valuable 
consideration, the. receipt of which is acknowledged, have sold, assigned, transferred, and 
conveyed and by this document do sell, assign, transfer, and convey, to the Arizona Board of ' 
Regents, its successors and its. assigns the entire right' and title to, and interest in the invention for 
which we have made the above-described apphcation and all improvements on the invention, the 
above-identified patent apphcation, any continued prosecution patent application (CPA), 
division, continuation or continuation-in-part that claims priority from the above-identified 
provisional application, and'airUmted States and foreign patents .that issue on fed above- 
identified invention and improvements, and all United States and foreign patents ana patent 
applications feat claim priority, from fee above-identified patent application mcludingany patent 
extensions, renewals or reissues, all for fee full term or terms for which any of fee foregoing may 

be granted. 
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We, ARAVIND R. DASU, ALI AKOGLU and SETHURAMAN pWcHANATHAN, 
authorize and request the Commissioner of Patents and Trademarks to issue any patents Of the 
United States, resulting from any application described aboveio the Arizona Board of Regents as 
die assignee of the rights identified above, for the sole use and benefit of the Arizona Board of 
Regents, its suooessors and its' assigns. ' 

We, ARAVIND R DASU, AJLI AKOGLU and SETHURAMAN PANCHANATHAN, 
for the considerate stated above, represent to and agree with the Arizona Board of Regents, its 
successors, and its assigns, that we have the foil power to make this assignment, and that the 
assigned rights are unencumbered by any previo^ly granted rights ticense. We, our- executors; 
or administrators will do all acts and things and execute and deliver without further 
compensation any other instruments, further applications, papers, affidavits, powers of attorney, 
assignments, and other documents that, in the opinion of the counsel for the Arizona Board of 
Regents or its successors and its assigns are or may be required to more folly secure and vest the 
assigned right, title, and interest in and to the Arizona Board of Regents, its successors, and its 
^assigns, and we will sign any applications for utility patent, applications for reissue, 
continuation-in-part applications or applications for extension or renewal that may be desired by 
the owner of the patent or patents that may be issued for the invention or improvements on the 
invention. 



Date:; _ Byr araviND R. DASU - 

STATE OF ) 

)ss.: 

COUNTY OF : ) 

BP IT KNOWN that on this day of -> 2003, before me personally appeared ARAVIND 

' Notary Public 
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Date: , _ ^lI^KqGLU 



STATE OF ; ) • 

)ss.t • ' ' • 

COUNTY OF • ) 

BE IT KNOW, that on this day of , 2003, before me pereoffllly appeared lAli • 

AKOGLU. known to me to be the person mentioned in and who executed me foregoing assignment, andhe 
^SmSuSSSL mat he executed me same m hb fee act and deed fox the use and pumoses memm noennoned. 

Notary Public 



Date: - SETHURAMAN PANCHANATHAN 



STATE OF . ) 

)ss:r 

COUNTY OF , ) 

ryo yr KNOWN that on this day of ^ 2003, before me personally appeared 

Sigmmt, and he ecknowledjeij to me Hat he encuttd toe seme as his ^ect ami deed totalise eiiapuiposee 
therein mentioned. 



Notary Public 
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